This document presents the primary analyses of the VIBRANT trial.
Currently, this runs on simulated data, but the code is mostly ready to be run on the real data once it is available.
The simulated data is currently relatively realistic (but still needs some improvement), and is very optimistic (assumes a very successful trial).
Until we have the actual data, we intend to improve the simulated data (= make them more realistic) by
Model the metagenomic data from an actual file provided by Jacques’ lab (currently, we “invented” the format, but it would be much better to have an actual pilot example).
Add demographic variables
Add dropouts (participants who left the study before the endpoint), missed visits or missed data (failed assay or uncollected sample)
Add replacements
Add differences in LBP strain survival rates between sites, arms, and participants
Add inter-individual variability in how long the LBP strains survive
Add bacterial species (currently, we only have the 15 LBP strains, “other” L. crispatus, L. iners, and “non-Lacto” (lumped together))
(add contamination)
Consequently, some (supplementary) analyses are still missing
Demographics (Table 1)
Sensitivity analyses for the different populations (ITT, mITT, PP)
… + search for “TODO” or “XXX” within the document (placeholders for analyses that still need to be implemented or text that needs to be written)
These will be implemented once the corresponding improvement steps detailed above are themselves implemented.
Even with all these improvements, there will, inevitably, be un-anticipated issues with the real data, but the core of the methods (and corresponding code) will likely be unchanged.
Workflow
The document is organized as follow
Data preparation: load the raw data as provided by labs/teams, and collate them into a neat MultiAssayExperiment object containing all the data linked by participant x visit.
Checks, QC, and Outcomes Definitions: check the data for potential issues, perform quality controls (e.g, check for library sizes, potential contamination, etc.), and define outcomes of interest (colonization status, etc.)
Analyses: perform the primary analyses of the trial, including demographics, primary and secondary outcomes, and sensitivity analyses. This section has the main tables and figures.
Data preparation
Loading the data and creating SummarizedExperiment objects for each assays
In this section, we load the raw data from their original files (i.e., the data/files as generated by the corresponding labs or teams) and transform each assay into a SummarizedExperiment object. This will enable to then collect them into a single MultiAssayExperiment object and perform the integrative analyses.
To link the assays together,we define a common unique identifier for each participant x visit. This unique identifier (.sample) is the concatenation of the pid (participant ID) and the visit_code.
Clinical and survey (CRF) data
Information about the data
The data acquisition process is described XXXHEREXXXX.
The data encoding in a xxx database and quality control processes are described XXXHEREXXX.
The data was exported from the database by Lara Lewis (CAPRISA, South Africa) as a set of xxx files.
Lara Wautier (UCLouvain, Belgium) then imported these files in R and assembled the data into two tables: a participants and a visits table. The code is available XXXHEREXXX.
The participants table contains participant-invariant information such as the participant ID, the treatment arm, site, and study population (ITT, mITT, PP). The visits table contains participant-varying information collected at each visit.
The columns necessary for the primary trial analyses were then selected from these two and exported into two files: participants.RDS and visits.RDS, which we load below.
Loading the data
We load the participant and visits data, and create
one table which will become the MAE’s colData
one SE object which provides attendance information at each visit (this also ensures that the MAE object has data for all visits, which is useful for downstream analyses)
se_16S |>as_tibble() |>group_by(.feature) |>mutate(total_prop =sum(ASV_prop)) |>ungroup() |>arrange(-total_prop) |>mutate(.feature = .feature |>fct_inorder()) |>ggplot(aes(x = .feature, y = .sample, fill =asinh(ASV_counts))) +geom_tile() +scale_fill_continuous(low ="white", high ="steelblue2") +scale_x_discrete("ASVs", breaks =NULL) +scale_y_discrete("Samples",breaks =NULL)
Integrating all assays into a MultiAssayExperiment object
In this section, we assemble all SE objects created in the previous section into a single MultiAssayExperiment object.
colData
Code
mae_coldata |>as_tibble()
# A tibble: 4,738 × 20
sample_id site arm pid age race food_insecurity contraception
<chr> <fct> <fct> <fct> <int> <chr> <chr> <chr>
1 p1_0 CAP Pl p1 25 White No Non-hormonal method
2 p1_1000 CAP Pl p1 25 White No Non-hormonal method
3 p1_1001 CAP Pl p1 25 White No Non-hormonal method
4 p1_1002 CAP Pl p1 25 White No Non-hormonal method
5 p1_1003 CAP Pl p1 25 White No Non-hormonal method
6 p1_1004 CAP Pl p1 25 White No Non-hormonal method
7 p1_1005 CAP Pl p1 25 White No Non-hormonal method
8 p1_1006 CAP Pl p1 25 White No Non-hormonal method
9 p1_1007 CAP Pl p1 25 White No Non-hormonal method
10 p1_1100 CAP Pl p1 25 White No Non-hormonal method
# ℹ 4,728 more rows
# ℹ 12 more variables: gender_sexual_partners <chr>,
# nb_partner_pastmonth <dbl>, nb_partner_life <dbl>, dropout <lgl>,
# dropout_time <int>, visit_code <dbl>, visit_abbr <fct>, visit_name <fct>,
# visit_number <dbl>, day <dbl>, attended <lgl>, planned <lgl>
A MultiAssayExperiment object of 3 listed
experiments with user-defined names and respective classes.
Containing an ExperimentList class object of length 3:
[1] visit_attendance: SummarizedExperiment with 2 rows and 4738 columns
[2] mg_ksanity: SummarizedExperiment with 16 rows and 927 columns
[3] amplicon_ASV_all: SummarizedExperiment with 20 rows and 927 columns
Functionality:
experiments() - obtain the ExperimentList instance
colData() - the primary/phenotype DataFrame
sampleMap() - the sample coordination DataFrame
`$`, `[`, `[[` - extract colData columns, subset, or experiment
*Format() - convert into a long or wide DataFrame
assays() - convert ExperimentList to a SimpleList of matrices
exportClass() - save data to flat files
Checks, QC, and Outcomes Definitions
Checks and Quality Controls
Visualization of available data for each assay and each site x arm x participant x visit
# A tibble: 1,854 × 13
.feature .sample attended assay pid site arm visit_code visit_abbr
<chr> <chr> <lgl> <fct> <fct> <fct> <fct> <dbl> <fct>
1 attended p1_1000 TRUE mg_ksanity p1 CAP Pl 1000 ENR
2 attended p1_1000 TRUE amplicon_A… p1 CAP Pl 1000 ENR
3 attended p1_1100 FALSE mg_ksanity p1 CAP Pl 1100 Post-ABX
4 attended p1_1100 FALSE amplicon_A… p1 CAP Pl 1100 Post-ABX
5 attended p1_1200 TRUE mg_ksanity p1 CAP Pl 1200 WFU
6 attended p1_1200 TRUE amplicon_A… p1 CAP Pl 1200 WFU
7 attended p1_1300 TRUE mg_ksanity p1 CAP Pl 1300 WFU
8 attended p1_1300 TRUE amplicon_A… p1 CAP Pl 1300 WFU
9 attended p1_1400 TRUE mg_ksanity p1 CAP Pl 1400 WFU
10 attended p1_1400 TRUE amplicon_A… p1 CAP Pl 1400 WFU
# ℹ 1,844 more rows
# ℹ 4 more variables: visit_name <fct>, visit_number <dbl>,
# available_data <lgl>, assay_scheduled_at_visit <lgl>
Code
available_data |>filter(attended ==TRUE) |># ou available_data == TRUEggplot(aes(x = assay, y = pid, col = assay)) +geom_point() +facet_grid(arm + site ~ visit_code, scales ="free", space ="free") +theme(axis.text.x =element_text(angle =90, hjust =1, vjust =0.5), legend.position ="none")
Checking for contamination
While it is possible for a few Placebo participants to natively have strains that are extremely similar to the LBP strains, it is highly unlikely for many of the LBP strains to be present simultaneously in a single sample for a placebo participant. Such event would much more likely be the result of contamination OR a mislabeling of the sample OR a randomization issue.
We thus check how many LBP strains are simultaneously present in the Placebo participants samples
TODO
Code
mae[["mg_ksanity"]] |>left_join(colData(mae) |>as.tibble() |> dplyr::select(.sample, site, arm, visit_code, pid), by =".sample" ) |>filter(arm =="Pl"&!is.na(LBP)) |>ggplot(aes(x = pid, y = unique_kmers, color = strain)) +geom_point() +facet_grid(visit_code ~ strain, scales ="free", space ="free") +theme(axis.text.x =element_text(angle =90, hjust =1, vjust =0.5))
Other checks ?
Outcomes definitions
Primary outcome definition
The primary outcome is XXX INSERT DEFINITION XXX
Consequently, we need to compute the relative abundance of each LBP strains from their proportions out of total L. crispatus proportions and the total L. crispatus proportions as estimated by 16S sequencing.
From this, we can compute the total relative abundance of all LBP strains or the maximum relative abundance of any LBP strain in each sample.
If the total relative abundance of all LBP strains is larger than 0.1 or the maximum relative abundance of any LBP strain is larger than 0.05, we consider that the LBP achieved colonization at that visit.
From the colonization status at each visit, we can compute the colonization status by each visit. A participant is considered to have colonized by a visit if they have been colonized at that visit or any previous visit.
16S rRNA microbiota composition, aggregated at the species level
TODO
16S rRNA microbiota composition, augmented with LBP strains relative abundances
We combine the 16S rRNA microbiota composition with the LBP strains relative abundances to create a new assay mb_composition.
TODO
We now have an “augmented” MAE object with the following assays:
Code
mae
A MultiAssayExperiment object of 5 listed
experiments with user-defined names and respective classes.
Containing an ExperimentList class object of length 5:
[1] visit_attendance: SummarizedExperiment with 2 rows and 4738 columns
[2] mg_ksanity: SummarizedExperiment with 16 rows and 927 columns
[3] amplicon_ASV_all: SummarizedExperiment with 20 rows and 927 columns
[4] LBP_rel_ab: SummarizedExperiment with 15 rows and 927 columns
[5] LBP_colonization: SummarizedExperiment with 4 rows and 927 columns
Functionality:
experiments() - obtain the ExperimentList instance
colData() - the primary/phenotype DataFrame
sampleMap() - the sample coordination DataFrame
`$`, `[`, `[[` - extract colData columns, subset, or experiment
*Format() - convert into a long or wide DataFrame
assays() - convert ExperimentList to a SimpleList of matrices
exportClass() - save data to flat files
Analyses
Demographics (Table 1)
In this section we will create a table with the demographics of the participants in the study.
If we have 0 successes in none of the Placebo arm, we won’t be able to use a logistic regression model that accounts for site differences since we have a “perfect separation issue” and the odds are infinite for the Placebo arm:
Code
fit <-glm(colonized_by ~ arm + site + arm:site, data = col_by_week5, family = binomial)fit |>summary()
Call:
glm(formula = colonized_by ~ arm + site + arm:site, family = binomial,
data = col_by_week5)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.857e+01 1.967e+03 -0.009 0.992
armLC-106-7 1.787e+01 1.967e+03 0.009 0.993
armLC-106-3 1.955e+01 1.967e+03 0.010 0.992
armLC-106-o 2.096e+01 1.967e+03 0.011 0.991
armLC-115 1.913e+01 1.967e+03 0.010 0.992
siteMGH 7.888e-11 2.723e+03 0.000 1.000
armLC-106-7:siteMGH 1.386e+00 2.723e+03 0.001 1.000
armLC-106-3:siteMGH 1.178e-01 2.723e+03 0.000 1.000
armLC-106-o:siteMGH 1.617e+01 4.248e+03 0.004 0.997
armLC-115:siteMGH -7.419e-01 2.723e+03 0.000 1.000
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 142.779 on 102 degrees of freedom
Residual deviance: 88.224 on 93 degrees of freedom
AIC: 108.22
Number of Fisher Scoring iterations: 17
There are a few ways to remedy that:
One way is to use a “pseudo-Bayesian” approach and to add 1 artificial success to all arms, but this introduces a small “defavorable” bias in the odds ratio estimates (and theoretically increases the p-values).
Code
tmp <- col_by_week5 |>bind_rows( col_by_week5 |> dplyr::select(arm, site) |>distinct() |>mutate(colonized_by =TRUE), col_by_week5 |> dplyr::select(arm, site) |>distinct() |>mutate(colonized_by =FALSE) ) |>filter(! (arm %in%c("LC-106-3", "LC-106-o")))fit <-glm(colonized_by ~ arm + site + arm:site, data = tmp, family = binomial)fit |>summary()
Call:
glm(formula = colonized_by ~ arm + site + arm:site, family = binomial,
data = tmp)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.48491 1.04083 -2.387 0.0170 *
armLC-106-7 1.87877 1.15798 1.622 0.1047
armLC-115 2.95491 1.18673 2.490 0.0128 *
siteMGH -0.08004 1.46978 -0.054 0.9566
armLC-106-7:siteMGH 1.27397 1.65195 0.771 0.4406
armLC-115:siteMGH -0.54411 1.67176 -0.325 0.7448
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 110.619 on 83 degrees of freedom
Residual deviance: 89.848 on 78 degrees of freedom
AIC: 101.85
Number of Fisher Scoring iterations: 5
Another way is to use an actual Bayesian model that accounts for site differences, but this does not provide p-values, only posterior distributions (and confidence intervals).
Code
col_by_week5_agg <- col_by_week5 |>group_by(site, arm) |>summarise(success = colonized_by |>sum(), ntrials =n())bfit <-brm( success |trials(ntrials) ~ arm + site + arm:site, data = col_by_week5_agg, family =binomial("logit") )
Alternatively, we can use exact fisher tests (as initially proposed by Lara Lewis, CAPRISA) for each arm x site, correcting for multiple testing, but this does not allow us to directly test for site differences.